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Abstract. Given a set P of n uncertain points on the real line, each represented by its one-dimensional 
probability density function, we consider the problem of building data structures on P to answer range 
queries of the following three types for any query interval /: (1) top-1 query: find the point in P that 
lies in I with the highest probability, (2) top-A: query: given any integer k < n as part of the query, 
return the k points in P that lie in I with the highest probabilities, and (3) threshold query: given any 
threshold r as part of the query, return all points of P that lie in I with probabilities at least r. We 
present data structures for these range queries with linear or nearly linear space and efficient query 
time. 


1 Introduction 

With a rapid increase in the number of application domains, such as data integration, infor¬ 
mation extraction, sensor networks, scientific measurements etc., where uncertain data are 
generated in an unprecedented speed, managing, analyzing and query processing over such 
data has become a major challenge and have received significant attentions. We study one 
important problem in this domain, building data structures for uncertain data for efficiently 
answering certain range queries. The problem has been studied extensively with a wide range 
of applications [31114112811331361143144] . We formally define the problems below. 

Let R be any real line (e.g., the x-axis). In the (traditional) deterministic version of this 
problem, we are given a set P of n deterministic points on R, and the goal is to build a data 
structure (also called “index” in database) such that given a range, specified by an interval 
/CM, one point (or all points) in / can be retrieved efficiently. It is well known that a 
simple solution for this problem is a binary search tree over all points which is of linear size 
and can support logarithmic (plus output size) query time. However, in many applications, 
the location of each point may be uncertain and the uncertainty is represented in the form of 
probability distributions [4Tl6Hl4i431f44] . In particular, an uncertain point p is specified by its 
probability density function (pdf) f p : R —» R + U {0}. Let P be the set of n uncertain points 
in R (with pdfs specified as input). Our goal is to build data structures to quickly answer 
range queries on P. In this paper, we consider the following three types of range queries, 
each of which involves a query interval I = [xi, x r \. For any point p E P, we use Pr \p E I] to 
denote the probability that p is contained in /. 

Top-1 query: Return the point p of P such that Pr[p E I] is the largest. 

* A preliminary version of this paper appeared in the Proceedings of the 25th International Symposium on Algorithms 
and Computation (ISAAC 2014). 
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Fig. 1. The pdf of an uncertain point. 


Fig. 2. The cdf of the uncertain point in Fig. [I] 


Top-/c query: Given any integer k, 1 < k < n, as part of the query, return the k points p 
of P such that Pr[p £ I] are the largest. 

Threshold query: Given a threshold r, as part of the query, return all points p of P such 
that Pr[p £ I] > r. 

We assume f p is a step function, i.e., a histogram consisting of at most c pieces (or 
intervals) for some integer c > 1 (e.g., see Fig. [TJ) . More specifically, f p (x ) = t/* for Xj_i < 
x < Xi, i — 1,... ,c, with x 0 = — oo, x c = oo, and y\ — y c — 0. Throughout the paper, 
we assume c is a constant. The cumulative distribution function (cdf) F p (x) = I -oo fr^) dt 
is a monotone piecewise-linear function consisting of c pieces (e.g., see Fig. [2]) . Note that 
F p (+ oo) = 1, and for any interval / = [xi,x r \ the probability Pr \p £ I] is F p (x r ) — F p (xi). 
From a geometric point of view, each interval of f p defines a rectangle with the x-axis, and 
the sum of the areas of all these rectangles of f p is exactly one. Further, the cdf value F p (x) is 
the sum of the areas of the subsets of these rectangles to the left of the vertical line through x 
(e.g., see Fig. [3]), and the probability Pr[p £ I] is the sum of the areas of the subsets of these 
rectangles between the two vertical lines through x/ and x r , respectively (e.g., see Fig. |4|). 

As discussed in [3], the histogram model can be used to approximate most pdfs with 
arbitrary precision in practice. In addition, the discrete pdf where each uncertain point can 
appear in a few locations, each with a certain probability, can be viewed as a special case of 
the histogram model because we can use infinitesimal pieces around these locations. 

We also study an important special case where the pdf f p is a uniform distribution 
function, i.e., / is associated with an interval [xi(p), x r (p)\ such that f p (x ) = l/(x r (p) — xfp)) 
if x £ [x/(p), x r (p)\ and f p (x ) = 0 otherwise. Clearly, the cdf F p (x) = (x — xfp)) / (x r (p) — 
xi(p )) if x £ [xi(p),x r (p)\, F p (x ) = 0 if x £ (—oo,x;(p)), and F p (x ) = 1 if x £ (x r (p), +oo). 
Uniform distributions have been used as a major representation of uncertainty in some 
previous work (e.g., |12lll4ll30j ). We refer to this special case the uniform case and the more 
general case where f p is a histogram distribution function as the histogram case. 

Throughout the paper, we will always use / = [x;,x r ] to denote the query interval. The 
query interval / is unbounded if either x; = — oo or x r = +oo (otherwise, / is bounded ). For 
the threshold query, we will always use m to denote the output size of the query, i.e., the 
number of points p of P such that Pr[p £ I] > r. 

Range reporting on uncertain data has many applications [3II14128II36I43II44] . As shown 
in |3j, our problems are also useful even in some applications that involve only deterministic 
data. For example, consider the movie rating system in 1MDB where each reviewer gives 
a rating from 1 to 10. A top -k query on / = [7, +oo) would find “the k movies such that 
the percentages of the ratings they receive at least 7 are the largest”; a threshold query on 
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Fig. 3. Geometrically, F p (x) is equal to the sum of the Fig. 4. Geometrically, the probability Pr[p £ /] is equal 
areas of the shaded rectangles. to the sum of the areas of the shaded rectangles. 

/ = [7, +oo) and r = 0.85 would find “all the movies such that at least 85% of the ratings 
they receive are larger than or equal to 7”. Note that in the above examples the interval / 
is unbounded, and thus, it would also be interesting to have data structures particularly for 
quickly answering queries with unbounded query intervals. 


1.1 Previous Work 

The threshold query was first introduced by Cheng et al. [13] , Using R-trees, they jT4] gave 
heuristic algorithms for the histogram case, without any theoretical performance guarantees. 
For the uniform case, if r is fixed for any query, they proposed a data structure of 0(nr _1 ) 
size with 0(r~ l \ogn + m) query time [14] . These bounds depend on r _1 , which can be 
arbitrarily large. 

Agarwal et al. [3] made a significant theoretical step on solving the threshold queries for 
the histogram case: If the threshold r is fixed, their approach can build an 0(n ) size data 
structure in 0(n log n) time, with 0(m + log n) query time; if r is not fixed, they built an 
0(n log 2 n) size data structure in 0(nlog 3 n) expected time that can answer each query in 
0{m + log 3 n) time. Tao et al. [43MT| considered the threshold queries in two and higher 
dimensions. They provided heuristic results and a query takes 0(n ) time in the worst case. 
Heuristic solutions were also given elsewhere, e.g. [2811361140] . Recently, Abdullah et al. [I] 
extended the notion of geometric coresets to uncertain data for range queries in order to 
obtain efficient approximate solutions. 

Our work falls into the broad area of managing and analyzing uncertain data which 
has attracted significant attentions recently in database community. This line of work has 
spanned a range of issues from theoretical foundation of data models and data languages, 
algorithmic problems for efficiently answering various queries, to system implementation is¬ 
sues. Probabilistic database systems have emerged as a major platform for this purpose and 
several prototype systems have been built to address different aspects/challcnges in manag¬ 
ing probabilistic data, e.g. MYSTIQ [IS], Trio [6], ORION [13], MayBMS [29], PrDB [39] , 
MCDB [25]. Besides of the range queries we mentioned above, there has also been much 
work on efficiently answering different types of queries over probabilistic data, such as con¬ 
junctive queries (or the union of conjunctive queries) [TS11T9] . aggregates [2BI3S] . top-A; and 
ranking [l6|30l3T837l4T] . clustering jEEI], nearest neighbors PH2l35], and so on. We refer 
interested readers to the recent book |42j for more information. 
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Four Problem Variations 

Top-1 Queries Top-fc Queries 

Threshold Queries 

Preprocessing Time 0(n log n) 
Uniform Space O(n) 

Unbounded Query Time O(logn) 

0(n log n ) 

O(n) 

0{ logn + k) 

0(nlog n) 

O(n) 

0(logn + m ) 

Preprocessing Time 0(n log n) 

0(n log n) 

0(n log n) 

Histogram Space 

O(n) 

0{n) 

O(n) 

Query Time 

0(log n) 

T 

0(logn + m) 

Preprocessing Time 0(n log n) 

0(n log 2 n) 

0(n log 2 n) 

Uniform Space 

0{n) 

0(n log n ) 

0(nlog n ) 

Bounded Query Time 

0(log n) 

T 

0(logn + m) 

Preprocessing Time 0(n log 3 n) 

0(n log 3 n)* 

0(nlog 3 n)* [3j 

Histogram Space 

0(n log 2 n) 

0(n log 2 n) 

0(n log 2 n ) [3] 

Query Time 

0( log 3 n) 

0(log 3 n + k) 

0(log 3 n + m) [3] 


Table 1. Summary of our results (the result for threshold queries of the histogram bounded case is from [3]): T is 
0(k) if k = h?(log n log log n) and 0(log n + k log k) otherwise. For threshold queries, m is the output size of each 
query. All time complexities are deterministic except the preprocessing times for top-fc and threshold queries of the 
histogram bounded case (marked with *). 


As discussed in [3J, our uncertain model is an analogue of the attribute-level uncertainty 
model in the probabilistic database literature. Another popular model is the tuple-level uncer¬ 
tainty model |6H8E5], where a tuple has fixed attribute values but its existence is uncertain. 
The range query under the latter model is much easier since a d- dimensional range searching 
over uncertain data can be transformed to a (d + l)-dimensional range searching problem 
over certain data [3ll45j . In contrast, the problem under the former model is more challenging, 
partly because it is unclear how to transform it to an instance on certain data. 

1.2 Our Results 

Based on our above discussion, the problem has four variations: the uniform unbounded case 
where each pdf f p is a uniform distribution function and each query interval I is unbounded, 
the uniform bounded case where each pdf f p is a uniform distribution function and each 
query interval / is bounded, the histogram unbounded case where each pdf f p is a general 
histogram distribution function and each query interval / is unbounded, and the uniform 
bounded case where each pdf f p is a general histogram distribution function and each query 
interval I is bounded. Refer to Table [[] for a summary of our results on the four cases. 

Note that we also present solutions to the most general case (i.e., the histogram bounded 
case), which were originally left as open problems in the preliminary version of this paper 

m 

We say the complexity of a data structure is 0(A,B) if can be built in 0(A ) time and 
its size is 0(B). 

— For the uniform unbounded case, the complexities of our data structures for the three 
types of queries are all 0(n log n, n). The top-1 query time is O(logn); the top -k query 
time is 0(logn + k)\ the threshold query time is 0(logn + m ). 
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— For the histogram unbounded case, our results are the same as the above uniform un¬ 
bounded case except that the time for each top-A; query is 0{k) if k — i?(lognloglogn) 
and 0(logn-\-k log k) otherwise (i.e., for large k, the algorithm has a better performance). 

— For the uniform bounded case, the complexity of our top-1 data structure is 0(n logn, n), 
with query time O(logn). For the other two types of queries, the complexities of our data 
structures are both 0(n log 2 n, n logn); the top- k query time is 0(k) if k — I? (logn log logn) 
and 0(logn + klogk) otherwise, and the threshold query time is 0(logn + m). 

— For the histogram bounded case, for threshold queries, Agarwal et al. [3] built a data 
structure of size 0(nlog 2 n) in O(n log 3 n) expected time, with 0(\og 3 n + m) query 
time. Note that our results on the threshold queries for the two uniform cases and the 
histogram unbounded case are clearly better than the above solution in [3], For top-1 
queries, we build a data structure of 0(n log 2 n) size in 0(n log 3 n) (deterministic) time, 
with 0(log 3 n) query time. For top-A; queries, we build a data structure of 0(n log 2 n) 
size in O(n log 3 n) expected time, with 0(log 3 n + k ) query time. 

Note that all above results are based on the assumption that c is a constant; otherwise 
these results still hold with replacing n by c • n except that for the histogram bounded case 
the results hold with replacing n by c 2 n. 

The rest of the paper is organized as follows. We first introduce the notations and some 
observations in Section [2l We present our results for the uniform case in Section [3j The 
histogram case is discussed in Section [4] We conclude the paper in Section [5j 

2 Preliminaries 

Recall that an uncertain point p is specified by its pdf f p : R —* M + U {0} and the corre¬ 
sponding cdf is Fp{x) = ff f P (t)dt is a monotone piecewise-linear function (with at most 
c pieces). For each uncertain point p, we call Pr[p e I] the I-probability of p. Let T be the 
set of the cdfs of all points of P. Since each cdf is an increasing piecewise linear function, 
depending on the context, T may also refer to the set of the O(n) line segments of all cdfs. 
Recall that / = [xi,x r ] is the query interval. We start with an easy observation. 

Lemma 1. If xi = —oo, then for any uncertain point p, Pr[p 6 I] = F p (x r ). 

Proof. Due to xi = — oo, Pr [p 6 I] — f p (t)dt , which is exactly F p (x r ). □ 

Let L be the vertical line with x-coordinate x r . Since each cdf F p is a monotonically 
increasing function, there is only one intersection between F p and L. It is easy to know that 
for each cdf F p of T, the ^-coordinate of the intersection of F p and L is F p (x r ), which is the 
/-probability of p by Lemma [0 For each point in any cdf of F, we call its y-coordinate the 
height of the point. 

In the uniform case, each cdf F p has three segments: the leftmost one is a horizontal 
segment with two endpoints (— oo, 0) and ( xi(p ), 0), the middle one, whose slope is l/(x r (p) — 
Xi(p)), has two endpoints (xi(p), 0) and (x r (p),l), and the rightmost one is a horizontal 
segment with two endpoints (x r (p),l) and (+oo,l). We transform each F p to the line l p 
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containing the middle segment of F p . Consider an unbounded interval / with xi = —oo. We 
can use l p to compute Pr [p e I] in the following way. Suppose the height of the intersection 
of L and l p is y. Then, Pr \p G I] — 0 if y < 0, Pr [p G I] — y if 0 < y < 1, Pr \p G I] = 1 
if y > 1. Therefore, once we know l p fl L, we can obtain Pr [p G /] in constant time. Hence, 
we can use l p instead of F p to determine the /-probability of p. The advantage of using l p 
is that lines are usually easier to deal with than line segments. Below, with a little abuse of 
notation, for the uniform case we simply use F p to denote the line l p for any p G P and now 
T is a set of lines. 

Fix the query interval / = [xi,x r \. For each i, 1 < i < n, denote by p^ the point of P 
whose /-probability is the i-th largest. Based on the above discussion, we obtain Lemma [2] 
which holds for both the histogram and uniform cases. 

Lemma 2. If xi = —oo, then for each 1 < i < n, Pi is the point of P such that L fl F p . is 
the i-th highest among the intersections of L and all cdfs of F. □ 

Suppose Xi = —oo. Based on Lemma [2], to answer the top-1 query on /, it is sufficient to 
fold the cdf of T whose intersection with L is the highest; to answer the top -k query, it is 
sufficient to fold the k cdfs of T whose intersections with L are the highest; to answer the 
threshold query on / and r, it is sufficient to fold the cdfs of T whose intersections with L 
have ^-coordinates > r. 

Half-plane range reporting: As the half-plane range reporting data structure HU is 
important for our later developments, we briefly discuss it in the dual setting. Let S' be a set 
of n lines. Given any point q , the goal is to report all lines of S that are above q. An 0(n)-size 
data structure can be built in O(nlogn) time that can answer each query in 0(logn + m') 
time, where ml is the number of lines above the query point q HU The data structure can 
be built as follows. 

Let Us be the upper envelope of S (e.g., see Fig. [5]). We represent Us as an array of lines 
li, l 2 , • • •, lh ordered as they appear on Us from left to right. For each line Zj, Zj_i is its left 
neighbor and l i+ 1 is its right neighbor. We partition S into a sequence Li(S'), L 2 (S ),..., of 
subsets, called layers (e.g., see Fig. E}. The first layer Li(S) C S consists of the lines that 
appear on Us- For i > 1, L.fS) consists of the lines that appear on the upper envelope of the 
lines in S \ Uy=i Lj(S). Each layer Li(S) is represented in the same way as Us- To answer a 
half-plane range reporting query on a point q, let l(q ) be the vertical line through q. We first 
determine the line l t of Li(S) whose intersection with l(q) is on the upper envelope of Li(S), 
by doing binary search on the array of lines of Li(S). Then, starting from l i: we walk on the 
upper envelope of Li(S) in both directions to report the lines of Li(S) above the point q, in 
linear time with respect to the output size. Next, we fold the line of L/ 2 (S) whose intersection 
with l(q) is on the upper envelope of L 2 (S). We use the same procedure as for Li(S) to report 
the lines of L 2 (S) above q. Similarly, we continue on the layers L 3 (S), L 4 (5'),..., until no 
line is reported in a certain layer. By using fractional cascading (9], after determining the 
line h of Li(S) in O(logn) time by binary search, the data structure [TT] can report all lines 
above q in constant time each. 
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Fig. 5. Partitioning S into three layers: Li(S) = {1,2,3}, IRS') = {4,5,6}, Ls(S) = {7,8}. The thick polygonal 
chain is the upper envelope of S. 

For any vertical line /, for each layer Lj(S'), denote by /*(/) the line of L^S) whose 
intersection with l is on the upper envelope of Li(S). By fractional cascading [9], we have 
the following lemma for the data structure HU- 

Lemma 3. [91111] For any vertical line l, after the line li(l) is known, we can obtain the 
lines h(l), h(l), ■■■ in this order in 0(1) time each. □ 


3 The Uniform Distribution 

In this section, we present our results for the uniform case. We first discuss our data structures 
for the unbounded case in Section 13.11 which will also be needed in our data structures for 
the bounded case in Section [3721 Further, the results in Section [3TT1 will also be useful in our 
data structures for the histogram case in Section [U 
Recall that in the uniform case J- is a set of lines. 

3.1 Queries with Unbounded Intervals 

We first discuss the unbounded case where I = [xi,x r \ is unbounded and some techniques 
introduced here will also be used later for the bounded case. Without loss of generality, we 
assume xi = —oo, and the other case where x r = +oo can be solved similarly. Recall that L 
is the vertical line with ^-coordinate x r . 

For top-1 queries, by Lemma [21 we only need to maintain the upper envelope of T, 
which can be computed in O(nlogn) time and 0{n) space. For each query, it is sufficient to 
determine the intersection of L with the upper envelope of T, which can be done in O(logn) 
time. 

Next, we consider top-A; queries. 

Given I and k, by Lemma [21 it suffices to find the k lines of T whose intersections with L 
are the highest, and we let Tk denote the set of the above k lines. As preprocessing, we build 
the half-plane range reporting data structure (see Section [2]) on T , in O(n log n) time and 
0(n) space. Suppose the layers of T are L\{fF), L 2 (J r ) 1 .... In the sequel, we compute the 
set J-}.. Let the lines in be l 1 , l 2 ,... ,l k ordered from top to bottom by their intersections 
with L. 

Let k(L) be the line of LfJ 7 ) which intersects L on the upper envelope of the layer 
Li^J 7 ), for i = 1,2,.... We first compute h(L) in 0(log n) time by binary search on the 
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upper envelope of L 1 (J 7 ). Clearly, l 1 is li(L). Next, we determine P. Let the set H consist 
of the following three lines: Z 2 (L), the left neighbor (if any) of h(L) in L 1 (J r ), and the right 
neighbor (if any) of h(L) in L 1 (J r ). 

Lemma 4. P is the line in H whose intersection with L is the highest. 

Proof. Note that P is the line of J- \ {Z 1 } whose intersection with L is the highest. We 
distinguish two cases: 

1. If P is in Li(J 7 ), since the slopes of the lines of Li(J r ) from left to right are increasing, P 
must be a neighbor of l 1 . Hence, P must be either the left neighbor or the right neighbor 
of l 1 in Li(J r ). 

2. If P is not in Li(J r ), then Z 2 (L) must be the line of T \ LpF) whose intersection with L 
is the highest. According to the definition of the layers of T the upper envelope of /^(J 7 ) 
is also the upper envelope of T \ L\{T). Therefore, Z 2 (L) is the line of T \ L\{fF) whose 
intersection with L is the highest. Hence, P must be Z 2 (W)- 

The lemma thus follows. □ 

We refer to H as the candidate set. By Lemma [4j we find P in H in 0(1) time. We 
remove P from H , and below we insert at most three lines into H such that P must be in H. 
Specifically, if P is Z 2 (L), we insert the following three lines into H\ Z 3 (L), the left neighbor 
of Z 2 (L), and the right neighbor of Z 2 (L). If P is the left (resp., right) neighbor l of h(L), we 
insert the left (resp., right) neighbor of l in LpF) into H. By generalizing Lemma U we can 
show P must be in H (the details are omitted). We repeat the same algorithm until we find 
l k . To facilitate the implementation, we use a heap to store the lines of H whose “keys” in 
the heap are the heights of the intersections of L and the lines of H. 

Lemma 5. The set J 7 *. can be found in 0( log?r + klogk ) time. 

Proof. According to our algorithm, there are 0{k) insertions and “Extract-Max” operations 
(i.e., finding the element of H with the largest key and remove the element from H) on the 
heap H. The size of P[ is always bounded by 0{k ) during the algorithm. Hence all operations 
on H take 0{k\ogk) time. Further, after Ending l\(L) in O(logn) time, due to Lemma [3] 
the lines that are inserted into H can be found in constant time each. Hence, the total time 
for finding J-). is 0{k log k + logn). □ 

We can improve the algorithm to 0(\ogn + k) time by using the selection algorithm in 
[23] for sorted arrays. The key idea is that we can implicitly obtain 2k sorted arrays of 0(k ) 
size each and J 7 /- can be computed by Ending the largest k elements in these arrays. The 
details are given in Lemma 0 

Lemma 6. The set Jcan be found in 0(\ogn + k ) time. 
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Proof. Consider any layer L i (J r ). Suppose the array of lines of L^J 7 ) is li, l 2 , ■ ■ ■, lh and let lj 
be the line li(L). The intersections of the lines lj, lj+i, ■ ■ ■ ,lh with L are sorted in decreasing 
order of their heights, and the intersections of the lines lj-i, lj- 2 , ■ ■ ■ ,h with L are also sorted 
in decreasing order of their heights. Once lj is known, we can implicitly obtain the following 
two arrays A r t and A[: the f-th element of At (resp., A l f) is the height of the intersection of 
lt-j+i (resp., lj-t) and L. Since these lines are explicitly maintained in the layer L i (J r ), given 
any index f, we can obtain the f-th element of A[ (resp., A l f) in 0(1) time. 

To compute the set J-*,, we first find the lines U{T) for i — 1, 2,..., k, which can be done 
in O(logn) time due to Lemma [3] Consequently, we obtain the 2k arrays A r t and A[ for 
1 < i < k, implicitly. In fact we only need to consider the first k elements of each such array, 
and below we let A r t and A\ denote the arrays only consisting of the first k elements. An 
easy observation is that the heights of the intersections of L and the lines of J-*. are exactly 
the largest k elements of A = (jf =1 {A[ U A[}. 

In light of the above discussion, to compute J-/, : , we do the following: (1) find the k-th 
largest element r of A; (2) find the lines of A whose intersections with L have heights at 
least r, which can be done in 0(k ) time by checking the above 2k sorted arrays with r in 
their index orders. Below, we show that we can compute r in 0(k ) time. 

Recall that A contains 2k sorted arrays and each array has k elements. Further, for any 
array, given any index f, we can obtain its f-th element in constant time. Hence, we can find 
the k -th largest element of A in 0{k ) time by using the selection algorithm given in [23] for 
matrices with sorted columns (each sorted array in our problem can be viewed as a sorted 
column of a k x 2k matrix). 

The lemma thus follows. □ 

Hence, we obtain the following result. 

Theorem 1. For the uniform case, we can build in O(nlogn) time an 0{n ) size data struc¬ 
ture on P that can answer each top-k query with an unbounded query interval in 0(k + logn ) 
time. 

For the threshold query, we are given / and a threshold r. We again build the half-plane 
range reporting data structure on T. To answer the query, as discussed in Section [2] we 
only need to find all lines of T whose intersections with L have //-coordinates larger than or 
equal to r. We first determine the line h(L) by doing binary search on the upper envelope of 
Li(J r ). Then, by Lemma [31 we find all lines / 2 (L), Z 3 (L),... ,lj(L) whose intersections have 
//-coordinates larger than or equal to r. For each i with we walk on the upper 

envelope of Li(T\ starting from U(L), on both directions in time linear to the output size 
to find the lines whose intersections have ^/-coordinates larger than or equal to r. Hence, the 
running time for answering the query is 0(logn + m). 

3.2 Queries with Bounded Intervals 

Now we assume / = [xi,x r \ is bounded. Consider any point p e P. Recall that p is associated 
with an interval [xi(p), x r (p)\ in the uniform case. Depending on the positions of I = [xi,x r ] 
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and [xi(p),x r (p)\, we classify [xi(p),x r (p)\ and the point p into the following three types with 
respect to /. 

L-type: [xi(p),x r (p)] and p are L-type if xi < xfp). 

R-type: [xi(p),x r (p)\ and p are R-type if x r > x r (p). 

M-type: [xfp), x r (p)] and p are M-type if / C (xi(p), x r (p)). 

Denote by Pl, Pr, and Pm the sets of all L-type, -R-type, and M-type of points of P, 
respectively. In the following, for each kind of query, we will build an data structure such 
that the different types of points will be searched separately (note that we will not explicitly 
compute the three subsets Pl, Pr, and Pm)- For each point p G P, we refer to xfp ) as 
the left endpoint of the interval [xfp), x r {p)) and refer to x r (p) as the right endpoint. For 
simplicity of discussion, we assume that no two interval endpoints of the points of P have 
the same value. 

3.2.1 Top-1 Queries 

For any point p G P, denote by J>(p) the set of the cdfs of the points of P whose intervals 
have left endpoints larger than or equal to xfp). Again, as discussed in Section[2]we transform 
each cdf of J>(p) to a line. We aim to maintain the upper envelope of J>(p) for each p G P. 
If we computed the n upper envelopes explicitly, we would have an data structure of size 
i?(n 2 ). To reduce the space, we choose to use the persistent data structure [21] to maintain 
them implicitly such that data structure size is O(n). The details are given below. 

We sort the points of P by the left endpoints of their intervals from left to right, and let 
the sorted list be p\ , p ' 2 ,... ,p' n . For each i with 2 < i < n, observe that the set Prfp'i-i) has 
exactly one more line than If we maintain the upper envelope of J>(p') by a balanced 

binary search tree (e.g., a red-black tree), then by updating it we can obtain the upper 
envelope of J>(p'_ 1 ) by an insertion and a number of deletions on the tree, and each tree 
operation takes O(logn) time. An easy observation is that there are 0(n ) tree operations 
in total to compute the upper envelopes of all sets P r {p\), Prfpf)-: ■ ■ ■ r(p ' n )• Further, by 
making the red-black tree persistent m, we can maintain all upper envelopes in 0(n log n) 
time and 0(n) space. We use £ to denote the above data structure. 

We can use £ to find the point of Pl with the largest /-probability in O(logn) time, 
as follows. First, we find the point p\ such that xi(p' t _ { ) < xi < xfp'f). It is easy to see 
that Prip'i) = Pl- Consider the unbounded interval /' = (—oo,ay]. Consider any point p 
whose cdf is in Due to xfp) > Xi, we can obtain that Pr \p e I] — Pr[p 6 I'). Hence, 

the point p of (p) ) with the largest value Pr [p G I] also has the largest value Pr[p G I']. 
This implies that we can instead use the unbounded interval I' as the query interval on the 
upper envelope of J>(p'), in the same way as in Section [3711 The persistent data structure 
£ maintains the upper envelope of such that we can find in O(logn) time the point 

p of T r (jp'i ) with the largest value Pr[p G I']. 

Similarly, we can build a data structure TZ of O(n) space in O(nlogn) time that can find 
the point of Pr with the largest /-probability in O(logn) time. 


10 



Pu 


qi 


Fig. 6. Dragging a segment of slope 1 out of the corner at qr. q* is the first point that will be hit by the segment. 


To find the point of Pm with the largest /-probability, the approach for Pr and Pr does 
not work because we cannot reduce the query to another query with an unbounded interval. 
Instead, we reduce the problem to a “segment dragging query” by dragging a line segment 
out of a corner in the plane, as follows. 

For each point p of P, we define a point q = (xi(p),x r (p)) in the plane, and we say that 
p corresponds to q. Similar transformation was also used in [H], Let Q be the set of the n 
points defined by the points of P. For the query interval / = [x;,av], we also define a point 
qi = ( xi,x r ) (this is different from [2], where / defines a point (x r ,xi)). If we partition the 
plane into four quadrants with respect to qi, then we have the following lemma. 

Lemma 7. The points of Pm correspond to the points of Q that strictly lie in the second 
quadrant (i.e., the northwest quadrant) of qj. 

Proof. Consider any point p G P. Let q = (xi(p),x r (p)) be the point defined by p. On the 
one hand, p is in Pm if and only if / C (xi(p ), x r (p) ), i.e., Xi > Xi(p) and x r < x r (p ). On the 
other hand, Xi > Xi(p) and x r < x r (p) if and only if q is in the second quarter of qj = ( xi , x r ). 
The lemma thus follows. □ 

Let p u be the upwards ray originating from <37 and let pi be the leftwards ray originating 
from qj. Imagine that starting from the point qj and towards northwest, we drag a segment 
of slope 1 with two endpoints on p u and pi respectively, and let q* be the point of Q hit first 
by the segment (e.g., see Fig. [ 6 ]). 

Lemma 8. The point of P that defines q* is in Pm and has the largest I-probability among 
all points in Pm- 

Proof. First of all, by Lemma \7\ q* must be in Pm- 

Consider any point q in the second quadrant of < 37 , and let p be the point of P that defines 
q. Since the interval of p contains the interval /, we have Pr [p e I] — x 

Based on the definition of q*, q* is the point q of Q in the second quadrant of qi that has 
the smallest value x r (p) — Xi{p). Therefore, q* is the point q of Q in the second quadrant of 
qj that has the largest value - The lemma thus follows. □ 

Based on Lemma [Sj to determine the point of Pm with the largest /-probability, we 
only need to solve the above query on Q by dragging a segment out of a corner. More 
specifically, we need to build a data structure on Q to answer the following out-of-corner 
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segment-dragging queries: Given a point q, find the first point of Q hit by dragging a segment 
of slope 1 from q and towards the northwest direction with the two endpoints on the two rays 
p u (q) and pi(q), respectively, where p u (q) is the upwards ray originating from q and pi(q ) is 
the leftwards ray originating from q. By using Mitchell’s result in [33] (reducing the problem 
to a point location problem), we can build an 0{n) size data structure on Q in O(nlogn) 
time that can answer each such query in O(logn) time. 

Theorem 2. For the uniform case, we can build in 0{n\ogn) time an 0(n ) size data struc¬ 
ture on P that can answer each top-1 query in O(logn) time. □ 

3.2.2 Top-k Queries 

To answer a top-k query, we will do the following. First, we find the top-k points in Pi (i.e., 
the k points of Pi whose /-probabilities are the largest), the top-A; points in P R , and the 
top-A; points in Pm- Then, we find the top-A; points of P from the above 3A; points. Below we 
build three data structures for computing the top-A; points in P L , P R , and Pm, respectively. 

We first build the data structure for Pi. Again, let p\,p' 2 ,..., p' n be the list of the points of 
P sorted by the left endpoints of their intervals from left to right. We construct a complete 
binary search tree Ti whose leaves from left to right store the n intervals of the points 
p\ , p 2 ,... ,p' n . For each internal node v, let P v denote the set of points whose intervals are 
stored in the leaves of the subtree rooted at v. We build the half-plane range reporting data 
structure discussed in Section [2] on P v , denoted by D v . Since the size of D v is \P V \, the total 
size of the data structure T L is O(nlogn), and T L can be built in 0(n log 2 n) time. 

We use T l to compute the top-A; points in P L as follows. By the standard approach and 
using xi, we find in O(logrt) time a set V of O(log n) nodes of T L such that Pi = [j v€V P v 
and no node of V is an ancestor of another node. Then, we can determine the top-A; points 
of Pi in similarly as in Section [3.11 However, since we now have O(logn) data structures 
D v , we need to maintain the candidate sets for all such D v 's. Specifically, after we find the 
top-1 point in D v for each v G V, we use a heap PI to maintain them where the “keys” are 
the /-probabilities of the points. Let p be the point of H with the largest key. Clearly, p is 
the top-1 point of Pj;, assume p is from D v for some v € V. We remove p from H and insert 
at most three new points from D v into //, in a similar way as in Section 13.11 We repeat the 
same procedure until we find all top-A; points of Pi. 

To analyze the running time, for each node v G V, we can determine in O(logn) time the 
line in the first layer of D v whose intersection with L is on the upper envelope of the first 
layer, and subsequent operations on D v each takes 0(1) time due to fractional cascading. 
Hence, the total time for this step in the entire algorithm is 0(log 2 n). However, we can do 
better by building a fractional cascading structure [9] on the first layers of D v for all nodes v 
of the tree Tl. In this way, the above step only takes O(logn) time in the entire algorithm, 
i.e., do binary search only at the root of //. In addition, building the heap H initially takes 
O (log n) time. Note that the additional fractional cascading structure on T L does not change 
the size and construction time of Ti asymptotically [9j. The entire query algorithm has 0(k) 
operations on H in total and the size of H is 0(\ogn + k). Hence, the total time for finding 
the top-A; points of Pi is 0(logn + k log (A; + log n)) , which is 0(\ogn + k log k) by Lemma [9] 
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Lemma 9. logn + klog(k + logn) = 0(logn + k log k). 


Proof. To simplify the notation, let n' = logn, and our goal is to prove k\og(k + n') + n' — 
0{k\ogk + n'). Depending on whether k > there are two cases. 

1- If k > ]o ”' w , , then logfc > logn' — log log n', implying that log log n' = 0(log/c). Thus, 

klog(k + n') < klog(k + A; logn') = 0(k log(ATogn')) 

= 0(k(logk + log logn')) = O(klogk). 


Hence, we obtain that k log (k + n') + n' = 0(k log k + n'). 

2 - If k < then klo &( k + n')<j^7 log(i^7 + n') = 0(^7logn') = O(n'). 

Hence, we obtain that k log(fc + n') + n' = 0(k log k + n'). 


The lemma thus follows. 


□ 


If k = i?(log?r log logn), we have a better result in Lemma flOl Note that comparing 
with Lemma El we need to use other techniques to obtain Lemma [TO] since the problem here 
involves O(logn) half-plane range reporting data structures D v while Lemma El only needs 
to deal with one such data structure. 


Lemma 10. If k = i?(lognloglogn), we can compute the top-k points in Pi in 0{k) time. 

Proof. We assume k = i7(log?r log logn). Recall that L is the vertical line with ^-coordinate 
x r . Let V be the set of O(logn) nodes of T L as defined before. Consider any node v G V, 
which is associated with a half-plane range reporting data structure D v on the cdfs of the 
point set P v . Let IF(v ) be the cdfs of the points of P v . Let J-(V) = U^ e yJ r (n). Our goal is to 
find the k lines of kF(V ) whose intersections with L are the highest, and denote by Tk the 
above k lines that we seek. 

We let Li(v) , L 2 (v) ,... be the layers of T(u), and for each layer Ljfv), denote by U(y) 
the line of L t (v) whose intersection with L is on the upper envelope of the layer L t {v). For 
each layer Lj(n), we define two arrays A^(v) and A\(y) in the same way as in the proof of 
Lemma E] (we omit the details). For each node v, we define another array B(y ) of size k as 
follows: for each 1 < i < k, the i-th element of B{y) is the height of the intersection of h(v) 
and L. Hence, the elements of B(v) are sorted in decreasing order. 

Our algorithm for computing Tk has two main steps. In the first main step, we will find a 
set B' of the largest k elements in B(V) = U ve vB(v). For each v G V, let j v be the number 
of elements of B(v) that are contained in Bf i.e., the first j v elements of B(v) are in B'. An 
easy observation is that the heights of the intersections of L and the sought lines of IFk are 
the largest k elements of A = {J v€V U^i{A[ (v) U A\(v)}, which contains 2k sorted arrays. 
In the second main step, we will find the largest k elements of A and thus obtain the set Bp. . 
Below, we show that the above two main steps can be done in 0(k ) time. 

We consider the first main step. For simplicity of discussion, we assume no two elements 
of B{V) are equal. First of all, as discussed above, in O(logn) time we can determine the lines 
h(v) for all v G V, and thus the first elements of all arrays B(v) for v G V are determined. 
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Further, for each array B(v), due to the fractional cascading, we can obtain the next element 
in constant time each, and in other words, we can obtain the first i elements of B(v) in O(i) 
time. To compute the set B', i.e., the largest k elements of B(V), since B(V) contains \V\ 
sorted arrays, one may want to use the selection algorithm in [23] again, as in Lemma [6] 
However, here we cannot use that algorithm because for each array in B(V), given any data 
structure, we cannot obtain the corresponding element in constant time. Instead, we propose 
the following approach. 

Recall that k = f2(logn log logn). For simplicity of discussion, we assume \V\ = logn 
and k > logn log logn. Let h = log logn. Let H be a max-heap. Initially H = 0. During 
the algorithm, we will maintain a set S of elements. Initially S — 0, and after the algorithm 
stops, S contains at most k + h elements. 

First of all, for each array B{v), we compute its first h elements and then insert only the 
/i-th element of B(y) into H. Now H contains \ V\ = logn elements. We do an “extract-max” 
operation on H , i.e., remove the largest element from H. Suppose the element removed above 
is from B[y) for a node v. Then, we add the first h elements of B(y) into S. If \S\ > k, 
the algorithm stops; otherwise, we compute the next h elements of B(y) and insert only the 
(2h)-th element of B(v) into H. 

In general, suppose we do an “extract-max” operation on H and let the removed element 
by the operation be the (i ■ h)- th element of the array B(v) for a node v. Then, we add the 
elements of B(v) with indices from i ■ h — 1 to i ■ h to S. If \S\ > k, the algorithm stops; 
otherwise, we compute the next h elements of B(v) and insert the [i ■ (h + l)]-th element of 
B(v) into H. 

After the algorithm stops, we do the following. Consider any element in the current heap 
H , and suppose it is the (j ■ h)- th element of the array B(v) for a node v. Then, according 
to our algorithm, the elements of B(v) with indices from j ■ h — 1 to j ■ h are not in S, and 
we call these h elements the red elements. Since the current H has at most |R| — 1 elements, 
there are at most h ■ (|R| — 1) red elements, and we let S' be the union of S and all red 
elements. 

We claim that the largest k elements of B(V) must be in S', i.e., B' C S'. We prove 
the claim as follows. Let a be the element that is removed from H in the last extract-max 
operation on H in the above algorithm, i.e., after a is removed, the algorithm stops. Let H 
be the heap after the algorithm stops. According to our algorithm, since each array B(v) is 
sorted decreasingly, a is the smallest element in S. Since [S’! > k and \B'\ = k, any element 
of B' must be larger than a. If B' C S, then the claim is proved. Otherwise, suppose an 
element b is in B' \ S. Since b > a and all elements of H are smaller than a, b must be a red 
element, and thus b is in S'. The claim is proved. 

In light of the above claim, B' can be easily obtained after we find the k- th largest element 
of S'. 

In the sequel, we analyze the running time of the above algorithm for the first main 
step. Recall that h = log logn. First of all, the size of the heap H is at most |R| = logn 
at any time during the algorithm. Clearly, the algorithm will stop after extract-max 
operations. The number of insertions is at most + \V\. Therefore, the running time 
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of all operations on H in the entire algorithm is + log n) log log n ), which is 0[k) 

due to k — i7(log n log log n). On the other hand, the number of red elements is at most 
h ■ (logn — 1), and thus \S'\ < h ■ (logn — 1) + \S\ < h ■ (logn — 1) + fc + h, which is 0{k) 
due to k — i?(logn log logn). Notice that the elements of B(V) that have been computed 
during the entire algorithm are exactly those in S', and thus the time for computing these 
elements is O(|*S"|) = 0(k). Finally, since |S"| = 0(k), we can find the k- th largest element 
in S' in O(k) time by using the well-known linear time selection algorithm. 

As a summary, the first main step can find the set B' in 0(k ) time. 

The second main step is to compute the largest k elements in A, which contains 2k arrays. 
As in the proof of Lemma [ 6 j we can obtain any arbitrary element of these arrays in constant 
time, without computing these arrays explicitly. Hence, by using the same approach as in 
Lemma El we can compute the largest k elements of A in 0(k ) time. Consequently, the set 
P k can be obtained. 

The lemma thus follows. □ 

To compute the top-fc points of P R , we build a similar data structure T R , in a symmetric 
way as T L , and we omit the details. 

Finally, to compute the top-fc points in Pm, we do the following transformation. For each 
point p 6 f, we define a point q = (xi(p),x r (p), 1 / (x r (p) — Xi(p)) in the 3-D space with x-, y-, 
and z- axes. Let Q be the set of all points in the 3-D space thus defined. Let the query interval 
/ define an unbounded query box (or 3D rectangle) Bj = (—oo, xi) x (x r , +oo) x (—oo, +oo). 
Similar to Lemma [7] in Section 13.11 the points of Pm correspond exactly to the points of 
Q fl Bj. Further, the top-fc points of Pm correspond to the fc points of Q D Bj whose z- 
coordinates are the largest. Denote by Qj the fc points of Q D Bj whose ^-coordinates are 
the largest. Below we build a data structure on Q for computing the set Qi for any query 
interval / and thus finding the top-fc points of Pm- 

We build a complete binary search tree Tm whose leaves from left to right store all points 
of Q ordered by the increasing ^-coordinate. For each internal node v of T M , we build an 
auxiliary data structure D v as follows. Let Q v be the set of the points of Q stored in the 
leaves of the subtree of Tm rooted at v. Suppose all points of Q v have ^-coordinates less than 
Xi- Let Q' v be the points of Q v whose ^-coordinates are larger than x r . The purpose of the 
auxiliary data structure D v is to report the points of Q' v in the decreasing ^-coordinate order 
in constant time each after the point of q v is found, where q v is the point of Q' v with the 
largest ^-coordinate. To achieve this goal, we use the data structure given by Chazellc and 
Guibas [10] (the one for Subproblem PI in Section 5), and the data structure is a hive graph 

[ 8 ] , which can be viewed as the preliminary version of the fractional cascading techniques 

[9] . By using the result in [10], we can build such a data structure D v of size 0(\Q V \) in 
0(\Q V \ log |Q„|) time that can first compute q v in 0(log|Q„|) time and then report other 
points of Q' v in the decreasing ^-coordinate order in constant time each. Since the size of D v 
is \Q V \, the size of the tree T M is O{n log n ), and T M can be built in 0(n\og 2 n ) time. 

Using Tm, we find the set Qi as follows. We first determine the set V of O(logn) nodes 
of T m such that IJt.ev Qv consists of all points of Q whose ^-coordinates less than Xi and no 
point of V is an ancestor of another point of V. Then, for each node v G V, by using D v , 
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we find q v , i.e., the point of Q v with the largest ^-coordinate, and insert q v into a heap H, 
where the key of each point is its z-coordinate. We find the point in H with the largest key 
and remove it from denote the above point by q[. Clearly, q[ is the point of Qj with the 
largest ^-coordinate. Suppose q[ is in a node v G V. We proceed on D v to find the point 
of Q v with the second largest ^-coordinate and insert it into H. Now the point of H with 
the largest key is the point of Qj with the second largest ^-coordinate. We repeat the above 
procedure until we find all k points of Qj. 

To analyze the query time, finding the set V takes O(logn) time. For each node v G V, 
the search for q v on D v takes 0(log n) time plus the time linear to the number of points 
of D v in Qi. Hence, the total time for searching q v for all vertices v G V is 0(log 2 n) time. 
Similarly as before, we can remove a logarithmic factor by building a fractional cascading 
structure on the nodes of Tm for searching such points qQs, in exactly the same way as in 
[8]. With the help of the fractional cascading structure, all these q v ’s for v G V can be found 
in 0(log n) time. Note that building the fractional cascading structure does not change the 
construction time and the size of T M asymptotically [8]. In addition, building the heap H 
initially takes O(logn) time. In the entire algorithm there are 0(k) operations on H in total 
and the size of H is always bounded by 0(k + logn). Therefore, the running time of the 
query algorithm is 0(logn + klog(k + logn)), which is 0(logn + /clog k) by Lemma 0 
Using similar techniques as in Lemma fTOl we obtain the following result. 

Lemma 11. If k — 17 (log n log log n), we can compute the top-k points in Pm in 0(k ) time. 

Proof. Consider any point v in V, which is associated with a set Q v and a data structure 
D v . Define an array B(v) of size k as follows: for each 1 < i < k, the i-tli element of B(v ) 
is the Tth largest ^-coordinate of the points of Q v . As discussed above, in O(logn) time we 
can obtain the first elements of B(v) for all v G V, and after that, we can obtain the next 
element of each array B(v) in constant time each by using the data structure D v . Our goal 
is to find the point set Qj. 

Let B(V) = U ve vB(v). An easy observation is that the ^-coordinates of the points of 
Qi are exactly the largest k elements in B(V). Since k = ,17(logn log logn), computing the 
largest k elements of B(V) can be done in 0(k) time in the same way as the first main step 
of the algorithm in the proof of Lemma [TUI and we omit the details. 

The lemma is thus proved. □ 

We summarize our results for the top-/c queries below. 

Theorem 3. For the uniform case, we can build in 0(nlog 2 n) time an 0(n log n) size data 
structure on P that can answer each top-k query in 0(k ) time if k = I2(logn log logn) and 
0{k log k + logn) time otherwise. □ 

3.2.3 Threshold Queries 

To answer the threshold queries, we build the same data structure as in Theorem [3[ i.e., 
the three trees Tl, Tm , and Tr. The tree Tl is used for finding the points p of Pl with 


16 


Pr[p G I] > r; Tr is for finding the points p of Pr with Pr[p G I] > r; T M is for finding the 
points p of P M with Pr[p G /] > r. The three trees T Ll T Rj and T M are exactly the same as 
those for Theorem [31 We can compute them in 0(?rlog 2 n) time and O(nlogn) space. 

Below, we discuss the query algorithms on the three trees. Let m R , m R , and rriM be the 
number of points in P L , P R , and Pm whose /-probabilities are at least r, respectively. Hence, 
m = m L + m R + m M . 

To hnd the points p of Pl with Pr[p G 1} > r, we first determine the set V of O(logn) 
nodes of 7/ such that [J uG y = P >L an d 110 n °de of V is an ancestor of another node of V. 
Recall that each node n of Tl is associated with a half-plane range reporting data structure 
D v . For each node v G V, by using D V1 we can hnd the points p of P v with Pr \p G /] > r in 
0(logn + m v ) time, where m v is the output size. Note that P v and P u are disjoint for any 
two nodes v and u of V. Hence, Y2 v ev m v = m L■ As there are O(logrt) nodes in V, it takes 
0(log 2 n + ttil) time to hnd all points p of Pl with Pr \p G I] > r, and again the 0(log 2 n) 
time factor can be reduced to O(logn) by using fractional cascading [9]. Hence, the total 
query time is 0 (logrt + m^). 

We can use the similar approach to hnd all points p of P R with Pr[p G I] > r in 
0(logn + m R ) time, by using T R . We omit the details. 

Finally, we hnd the points p of Pm with Pr[p G I] > r, by using Tm■ As in Section 13.2.21 
for each point pG?, we define in the 3-D space a point (xi(p), x r (p), l/(x r (p ) — Xi(p)). Let 
Q be the set of all n points defined above. Let the interval / = [xi,x r \ and r together define 
an unbounded 3-D box query Bj = (—oo, x{) x ( x r , +oo) x [r, +oo). Let Qj = QCiBj. Hence, 
the points p of Pm with Pr[p G I] > r correspond to the points of Qi, and thus tum = \Qi\- 

By using the tree T M , we can hnd Qj in 0(\ogn+mM ) time, as follows. We hrst determine 
the set V of O(logn) nodes of Tm such that {J veV Q v consists of all points of Q whose x- 
coordinates are less than xi and no node of V is an ancestor of another node of V. Consider 
any node v G V. Let Q' v be the points of Q v whose ^-coordinates are larger than x r , and q v 
be the point Q' v with the largest ^-coordinate. Recall that after q v is found D v can report 
other points of Q' v in the decreasing ^-coordinate order in constant time each. Hence after 
q v is known we can report the points of Q v in the query box Bj in time linear to the output 
size. Again, with the help of fractional cascading, the nodes q v for all v G V can be found in 
O(logn) time. Therefore, we can hnd all points of Qi in 0(log n + rriM) time. In other words, 
with T m , we can hnd the points of of Pm whose /-probabilities at least r in 0 (log?r + tum) 
time. 

Hence, we obtain Theorem [3J 


Theorem 4. For the uniform case, we can build in 0(nlog 2 n) time an O(n log n) size data 
structure that can answer each threshold query in 0(m + logn) time, where m is the output 
size of the query. 
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4 The Histogram Distribution 


In this section, we present our data structures for the histogram case. In the histogram case, 
the cdf of each point p e P has c pieces; recall that we assumed c is a constant, and thus T 
is still a set of 0(n) line segments. 

We first discuss our data structures for the unbounded case in Section 14.11 and then 
present our results for the bounded case in Section 14.21 

4.1 Queries with Unbounded Intervals 

Again, we assume w.l.o.g. that xi = — oo. Recall that L is the vertical line with ^-coordinate 
x r . Note that Lemmas [1] and [2] are still applicable. 

4.1.1 Top-1 Queries 

For the top-1 queries, as in Section I5TT1 it is sufficient to maintain the upper envelope of J~. 
Although T now is a set of line segments, its upper envelope is still of size 0(n ) and can 
be computed in O(nlogn) time |5]j. Given the query interval /, we can compute in 0(log n) 
time the cdf of T whose intersection with L is on the upper envelope of T. 

Theorem 5. In the histogram case, we can build in 0(n\ogn) time an 0(n ) size data struc¬ 
ture on P that can answer each top-1 query with an unbounded query interval in O(logn) 
time. 

4.1.2 Threshold Queries 

For the threshold query, as discussed in Section [2] we only need to fold the cdfs of J- whose 
intersections with L have //-coordinates at least r. Let qj be the point (x r , r) on L. A line 
segment is vertically above qj if the segment intersects L and the intersection is at least as 
high as qj. Hence, to answer the threshold query on /, it is sufficient to fold the segments of 
T that are vertically above qj. Agarwal et al. [3j gave the following result on the segment- 
below-point queries. For a set S of 0(n ) line segments in the plane, a data structure of 0(n ) 
size can be computed in O(nlogn) time that can report the segments of S vertically below 
a query point q in 0{m' + logn) time, where ml is the output size. In our problem, we need 
a data structure on T to solve the segments-above-point queries , which can be solved by 
using the same approach as [3]. Therefore, we can build in O(nlogn) time an 0[n ) data 
structure on P that can answer each threshold query with an unbounded query interval in 
0(m + logn) time. 

Theorem 6. In the histogram case, we can build in O(nlogn) time an 0(n ) size data 
structure on P that can answer each threshold query with an unbounded query interval in 
0(m + logn) time, where m is the output size of the query. 


18 




4.1.3 Top-k Queries 


For the top-fc queries, we only need to find the k segments of T whose intersections with 
L are the highest. To this end, we can slightly modify the data structure for the segment- 
below-point queries given in [3]. 

The data structure in [3J is a binary tree structure that maintains a number of sets of 
lines (each such line contains a segment of T). For each such set of lines, a half-plane range 
reporting data structure similar to that in Section [2] is built, where the lower envelopes 
(instead of the upper envelopes as we discussed in Section [2]) of the layers of the lines are 
maintained. For our purpose, we replace it by our half-plane range reporting data structure 
in Section [2] (i.e., maintain the upper envelopes). With this modification, we can answer the 
segments-above-point queries in the following way. 

Consider a query point q = (ay, —oo) (i.e., the lower infinite endpoint of L), and suppose 
we want to find the segments of T vertically above q, which are also the segments intersecting 
L. By using the data structure [3] modified as above, the query algorithm works as follows. 
First, with the help of fractional cascading, in O(logn) time, the query algorithm will find 
O(logn) half-plane range reporting data structures such that for each such data structure 
D the segment intersecting L on the upper envelope of D is known. Second, for each such 
half-plane range reporting data structure D, from the above known segment intersecting L, 
by using the fractional cascading and walking on the upper envelopes of the layers of D, we 
can report all lines of D higher than q in constant time each. The first step takes O(logn) 
time, and the second step takes 0(m ') time, where m! is the total output size. 

For our problem, we only need to report the highest k segments of T that are vertically 
above q. To this end, we will modify the query algorithm such that the segments of T 
vertically above q will be reported in order from top to bottom, and once k segments are 
reported, we will terminate the algorithm. We use a heap H in a similar way as in Section 
13. 1 1 for the uniform case. Specifically, in the first step, we find the 0(log n) half-plane range 
reporting data structures, and for each such data structure H , the highest segment of D 
intersecting L is known. In the second step, we build a heap H on these O(logn) segments 
where the keys are the ^-coordinates of their intersections with L. The segment in H with the 
largest key must be the highest segment of T intersecting L. We remove the segment from 
H , and let D be the half-plane range reporting data structure that contains the segment. As 
in Section 13.11 for the uniform case, we determine in constant time at most three segments 
from D and insert them to H. Now the segment of H with the largest key is the second 
highest segment of T intersecting L. We repeat the above procedure until we have reported 
k segments. 

To analyze the running time, the first step takes O(logn) time. In the second step, 
we have 0{k ) operations on H and the segments that are inserted to H can be found in 
constant time each by the range-reporting data structures. The size of the heap H in the 
entire query algorithm is 0(k + log n) . Hence, the running time of the query algorithm is 
0(k log(fc + logn) + log n) , which is 0(k\ogk + logn) by Lemma [9] 

Similarly, if k — I? (logn log logn), we can answer the top-fc query in 0(k) time, as follows. 
As discussed above, in O(logn) time we find the O(logn) half-plane range reporting data 
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structures, and for each such data structure D, the highest segment of D intersecting L is 
known. The answer to the top-fc query is the highest fc intersections of L and the lines in 
these O(logn) half-plane range reporting data structures. This is exactly the same situation 
as in Lemma [lOl where we also have O(logn) half-plane range reporting data structures. 
Hence, the algorithm in Lemma flOl is applicable here, which runs in 0(k) time. 

In summary, we obtain the following results. 

Theorem 7. We can build in 0(nlogn) time an 0(n ) size data structure on P that can an¬ 
swer each top-k query with an unbounded query interval in 0(k) time ifk — i?(lognloglogn) 
and 0[k\ogk + logn) time otherwise. 

4.2 Queries with Bounded Intervals 

In this case, the query interval / = [xi,x r ] is bounded. 

For this case, Agarwal et al. [3] built a data structure of size 0(n\og 2 n) in O(n log 3 n) 
expected time, which can answer each threshold query in 0(log 3 n + m ) time. We first briefly 
discuss this data structure (refer to Section 4 of j3j for more details) because our data 
structures for top -1 and top-fc queries also use some of their techniques. 

Agarwal et al. |3j built a data structure (a binary search tree), denoted by T, which 
maintains a family of canonical sets of planes in 3D (defined by the uncertain points of P). 
Consider any query interval / = [xi,x r \ with a threshold value r. Let q(I) be the point 
with coordinates (xi,x r ,r) in 3D, and let L(I) be the line through q and parallel to the 
z-axis. Using T, one can determine a family F(I) of 0(log 2 n) canonical sets of T with the 
following property: Each uncertain point p defines one and only one plane in F(I) such that 
the ^-coordinate of the intersection of the plane with L(I) is the probability Pr [p £ I]. Note 
that the canonical sets of F(I) are pairwise disjoint. 

To answer the threshold query on I and r, it is sufficient to report the planes in each 
canonical set of F(I) that lie above the point q(I). To this end, for each canonical set S of T, 
Agarwal et al. [3] constructed a halfspace range-reporting data structure given by Afshani 
and Chan [2j on the planes in S in 0(|Sj) space and 0(|5j log|Sj) expected time, such that 
given any point q, one can report the planes of S above q in 0(log|*Sj + M) time, where 
M is the output size. In this way, because there are 0(log 2 n) canonical sets in F(I), the 
threshold query can be answered in 0(log 3 n + m) time. The total space of T including the 
halfspace range-reporting data structures is O (n log 2 n) and T can be built in 0(nlog 3 n) 
expected time. 

4.2.1 Top-1 Queries 

Consider the top-1 query on the above query interval /. To answer the query, it suffices to 
fold the plane in F(I) whose intersection with L(I ) is the highest. To this end, it is sufficient 
to know the intersection of L(I) with the upper envelope of each canonical set of F(I). 
Therefore, for each canonical set S of T, instead of constructing a halfspace range-reporting 
data structure, we compute the upper envelope of the planes of S [20J and build a point 
location data structure [22127] on the upper envelope, which can be done in 0(|Sj) space 
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and 0 (|jS'| log|S'|) time. In this way, for each canonical set S of F(I), in O(logn) time we 
can determine the plane intersecting L(I) in the upper envelope of S. Hence, the top-1 query 
can be answered in 0(log 3 n) time since F(I) has 0(log 2 n) canonical sets. 

Comparing with the original data structure in [3], since we spend 0(|S'|) space and 
0(| S\ log 151) time on each canonical set S of T, the entire data structure can be constructed 
in 0(n log 2 n) space and O (n log 3 n) (deterministic) time. We summarize the result for the 
top -1 queries in the follow theorem. 

Theorem 8. In the histogram case, we can build in 0(n log 3 n) time an 0(nlog 2 ) size data 
structure on P that can answer each top-1 query with a bounded query interval in 0(log 3 n) 
time. 

4.2.2 Top-fc Queries 

Consider the top-A; query on the query interval I. To answer the query, it suffices to find 
the k planes in F(I) whose intersections with L(I) are the highest. To this end, for each 
canonical set S of T, we build a t-highest plane data structure given by Afshani and Chan 
[ 2 j on the planes of S in 0(|5|) space and 0(|5| log|S'|) expected time, such that given any 
integer t and any query line L parallel to the z-axis, the t highest planes of S at L can 
be found in 0(log |5| + 1) time. Comparing with the original data structure in [3J, since we 
spend asymptotically the same space and time on each canonical set of T, our data structure 
can be constructed in 0 (nlog 2 ?r) space and 0 (?rlog 3 n) expected time. 

To answer the top-A; query on /, one straightforward way works as follows. For each 
canonical set S of F(I), by using the t- highest plane data structure with t = k, we compute 
the highest k planes of S at L(I). Since there are 0(log 2 n) canonical sets in F(I ), the above 
computes 0(k\og 2 n ) planes, and among them the highest k planes at L(I ) are the answer 
to the top-A; query. The query time is 0(log 2 n(logn + k)). In the following, we present an 
improved query algorithm with time 0 (log 3 , n + k ). 

In the following discussion, for simplicity, whenever we refer to the relative order the 
planes (e.g., highest, lowest, higher, lower), we refer to their intersections with the line L(I). 
For example, by “a plane is higher than another plane”, we mean that the first plane has a 
higher intersection with L(I) than the second plane. For ease of exposition, we assume the 
intersection points of the planes of F(I) with L(I) are distinct. Note that F(I) is a family 
of canonical sets; but by slightly abusing the notation, when we say “a plane of F(/)”, we 
really mean that the plane is in a canonical set of F(I). 

We make use of some idea from Lemma [TUI although the details are quite different. Our 
algorithm has two steps: a main algorithm and a post-processing algorithm. We discuss the 
main algorithm first. 

Let / = |F(/)|, and thus / = 0(log 2 n). Let Si, S 2 ,..., S'/ be the canonical sets of F(I). 
For each canonical set Si, let S{ denote the set of the highest 2- 7-1 • logn planes of Si for 
j = 1,2,..., and we let Sj = 0 for j = 0. For each Si, our main algorithm maintains a 
subset S'i C Si and an integer j(i), such that S'' = S^'K We also maintain a max-heap 
H that contains the lowest plane in the subset S' for each Si G F(I). Hence, the size of 
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H is 0(log 2 n). The “keys” of the planes in H are the ^-coordinates of their intersections 
with L(I). In addition, our algorithm maintains an integer r, which is the size of a set R of 
planes. Before the main algorithm stops, R = u{ =1 Sf 1 (after the main algorithm stops, the 
definition of R is slightly different; see the details below). Note that our algorithm does not 
maintain R explicitly, and we use R only to argue the correctness of the algorithm. During 
the main algorithm, r will get increased, and the main algorithm stops once r > k (at which 
moment we have identified a set of 0(log 3 n + k ) planes, and among them the highest k 
planes are the answer to our top -k query, which will be found later by the post-processing 
algorithm). 

Initially, for each canonical set Si of F(I ), by using the /-highest plane data structure 
with / = logn, we compute S'' = S}, and further we find the lowest plane in S[ and insert 
it into H\ the above can be done in 0(logn + |S^|) time, which is 0(|<S^|) time due to 
|S''| = |S' 3 1 = logn. Also, initially we set r = 0 (R is implicitly set to 0), and set j(i) = 1 for 
each i with 1 < % < /. 

Next, we do an “extract-max” operation on H to find the highest plane in H and remove 
it from H. Suppose the above plane is from a canonical set S t for some i. Then, we let 
R — R U S[ and set r = r + | S'f|. Further, by using the /-highest plane data structure with 
/ = 2 logn, we compute S[ = Sf, and then we find the lowest plane in S'' and insert it into 
H] again, the above can be done in 0(|S^|) time. Finally, we update j(i) = 2. 

In general, we do an extract-max operation on the current H and suppose the removed 
plane is from a canonical set S) for some i. We let R = R U Sf l) \ (note that Sf 1 ^ 1 C 

S^ (t \ and thus this just adds those planes of Sf l> that are not in Sf l> 1 to R), and set 
r = r + | S'/1 — | S'/ —1 1 - Again, we do not explicitly maintain R but explicitly maintain r. 
If r > k, then we stop the main algorithm. Otherwise, by using the /-highest plane data 
structure with / = 2 Ad . logn, we compute S[ = Sf^ +1 (and the previous S[ is discarded), 
and further we find the lowest plane in S^ +1 and insert it into H ; again, the above can be 
done in 0(|S''|) time. Note that for ease of exposition, we assume |S)| > 2-A*) • log n (otherwise, 
we can solve the problem by similar techniques with more tedious discussions). Finally, we 
increase j(i) by one. 

The above finishes the main algorithm. After it stops, let R' = u{ =1 Sf l> . Let B be the 
set of k highest planes in F(I), i.e., B is the answer to our top-A: query on /. We have the 
following lemma. 

Lemma 12. R C R', B C R', and \R'\ = 0(log 3 n + k ). 

Proof. We first show R C R'. 

Suppose the last extract-max operation on FI in the main algorithm removes a plane from 
the canonical set Si for i = a. Hence, the algorithm stops after we have R = R\JSi^\Si^ 1 . 
Thus, Sa' a ' > C R. Consider any other canonical set S t with i ^ a. According to our algorithm, 
it always holds that Sf^ 1 C R. Therefore, we have the following: 

R = Sl {a) U |J and R' = [j S^. 

1 <i<f 
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Since 1 C Sj^ for any i, we obtain that R C R'. 

Next, we show that \R'\ = 0(log 3 n + k). 

Indeed, since the algorithm stops right after r = r + iSy | — iS'a X | > k and R = 
R U Si ^ \ Si ^ 1 , the original value of r before the above increasing is less than k. In 
other words, Yli<i<f \^i ^ _1 | < k. For each S t of F(I), if j(i) = 1, then \S : i^\ = logn 
and |Sf^ _1 | = 0; otherwise, |Sf^| = 2|Sf < ^ -1 |. Therefore, we obtain \R'\ = Ui<i</Sf^ < 
2'Si <i<f ^ _1 1 +/dog n = 2k+0(log 3 n) = 0(log 3 n+k) because / = \F(I)\ = 0(log 2 n). 

Finally, we prove BCR 1 . 

Let cr* denote the plane removed by the last extract-max operation on H in the main 
algorithm. We claim that cr* is the lowest plane in R. We prove the claim below. 

According to our algorithm, the planes removed by the extract-max operations on H 
follow the order from high to low. Consider any plane cr G R. To prove the claim, it is 
sufficient to show that cr* is not higher than cr. According to our algorithm, the first time a 
is added in R must be due to an operation on R: R = RUS?^ \ S^ -1 after an extract-max 
operation removes a plane cr' from H and a' is from a canonical set S,. This implies that 
cr G S'i^ \ S^ 1 . According to our algorithm, cr' is the lowest plane in the above Sj^\ and 
thus, a' is not higher than cr. On the other hand, since cr* is the last plane removed by the 
extract-min operations, cr* is not higher than a'. Therefore, cr* is not higher than cr, and the 
above claim is proved. 

Consider any plane cr G 5. To show B C R', it suffices to prove a G R'. If cr is in R, then 
since R C R', a G R! is true. Below we assume a R, and thus cr ^ cr*. 

Note that cr G B implies that there are at most k — 1 planes of F(I) higher than cr. Since 
r = | it| > k and cr* is the lowest plane in it, cr must be higher than cr* since otherwise all 
planes in it would be higher than cr, contradicting with that there are at most k — 1 planes 
higher than cr. 

Assume cr is in a canonical set A,- for some i. Recall that cr* is from the canonical set S a . 
Note that all planes of S a higher than cr* are in Sa' a> . By our definition of it, S J a (a) C it. 
Since cr is higher than cr* and cr R, we can obtain i ^ a. According to our algorithm, 
after the algorithm stops, H contains a plane a, from S t , and cr* is the lowest plane in Sf l> . 
Recall that, Sj^ C R!. This implies that all planes of S t higher than crj are in it 7 . Since cr* 
is removed by an extract-min operation and after the operation cr, is still in H } cr* must be 
higher than ay Because cr is higher than cr*, cr is higher than cr,. 

In summary, the above discussion obtains the following: cr is in S',; cr is higher than oy 
all planes of S', higher than crj are in R’. Thus, we obtain that cr is in R'. 

Therefore, we conclude that B C R\ and the lemma follows. □ 

Based on Lemma fl2| if we have the set R' explicitly, then we can compute B in additional 
O (log 3 n + k) time by using the linear time selection algorithm [15]. However, the above main 
algorithm does not explicitly compute R', but it has maintained j(i ) for each S'j G F(I). Since 
R! = Ui <i< f S{^\ we can compute R! by using a t-highest plane query with t = 2 3<l R 1 ■ logn 
on each canonical set S', of F(I). 

The following lemma gives the running time of our entire top-fc query algorithm. 
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Lemma 13. The time complexity of our top-k query algorithm is 0(log 3 n + k ). 

Proof. We first analyze the main algorithm, whose running time mainly depends on the time 
of the A-highest plane queries and the time of the operations on the heap H. We first give a 
bound on the time of the A-highest plane queries, with the help of Lemma [T2l 

Note that after each A-highest plane query in the main algorithm, we always find the 
lowest plane in the output planes of the query, whose time is only linear to the number of 
output planes and is upper bounded by the above query time. In the following, we focus on 
analyzing the time of the Lhighest plane queries. 

Consider any canonical set S t E F(I). According to our algorithm, for each j with 
1 < j < j(i), the main algorithm performs a t- highest plane query on S t with t = 2 J ~ 1 ■ logn 
to compute 5/, which takes 0(|5/|) time (we ignore the logn factor in the query time because 
logn < | S'/)). Hence, the total time of the A-highest plane queries on S) in the main algorithm 
is 0(Yjj= 1 1'S'/D- Note that 

j{i) i(i) 

^ 15/| = logn • ^ 2 J ’ _1 < logn • 2 j A> = 2 • |5/^|. 

3 =1 3 =1 

Recall that |i?'| = Yhi<i<f I'S'/^I- Hence, the total time on the A-highest plane queries in 
the entire main algorithm is O (| -R / |), which is 0(log 3 n + k ) by Lemma [T2l 

Next, we analyze the time we spent on the heap H. Recall that the size of H is 0(log 2 n). 
Initially, we build H on 0(log 2 n) planes, which can be done in 0(log 2 n) time. Later in the 
algorithm, the operations on H include the extract-max and insertion operations. We need 
to figure out how many operations were performed on H in the main algorithm. 

Consider any extract-max operation on FA, and suppose the removed plane is from set 5;. 
Then, after the operation, we have R = R. U 5/^ \5/^ _1 , and since |5/^| — |5/^ _1 | > log n, 
the above increases R by at least logn planes. After that, there is at most one insertion 
operation on H. Since the main step stops once |i?| > k , the total number of extract-max 
operations is at most The number of insertion operations is also at most Since 
\H\ = 0(log 2 n), each operation on H takes 0(loglog?r) time. The total time on H is 
0 (log 2 n + • loglogn) = 0( log 2 n + k). 

Therefore, the total time of the main algorithm is 0(log 3 n + k). 

Finally, we analyze the running time of the post-processing step, which computes R' and 
finds the highest k planes in R'. Computing R' is done by doing a t-highest plane query 
with t = j(i) on each set 5*. Therefore, as above, the total time is at most |i?'|, which is 
O (log 3 n + k). Finding the highest k planes in R' takes O (| -R'|) time by using the linear time 
selection algorithm [13]. 

Thus, the total time of our top-A; query algorithm is 0(log 3 , n + k). □ 

The above discussion leads to the following theorem. 

Theorem 9. We can build in 0(n log 3 n) expected time an 0(n\ogf n) size data structure 
on P that can answer each top-k query with a bounded query interval in 0(log 3 n + k ) time. 

Note that the planes reported by our top-A; query algorithm are not in any sorted order. 
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5 Conclusions 


In this paper we present a number of data structures for answering a variety of range queries 
over uncertain data in one dimensional space. In general, our data structures have linear or 
nearly linear sizes and can support efficient queries. While it would be interesting to develop 
better solutions, an interesting but challenging open problem is whether we can generalize 
our techniques to solve the corresponding problems in higher dimensions, for which only 
heuristic results have been proposed f43li44] . 
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